Conversation
Yes, the below approach will be a intended/correct approach.
python-chebai-graph/chebai_graph/preprocessing/properties/base.py Lines 156 to 171 in 63216b4 |
…on-chebai-graph into feature/sdf-support
…ith custom method in fg rules
|
I have created a regular dataset which has updated the tokens.txt files. I also changed the However, I get an error when trying to load the data. when trying to execute this statement (try-catch block added for debugging): It looks like the number of nodes in the dataset (here: 24) is not the same as the number of nodes for which |
|
I also changed the linting to using only ruff (mirroring the main repository: ChEB-AI/python-chebai#145) |
There was a problem hiding this comment.
Pull request overview
Adds SDF-based dataset support across the graph preprocessing pipeline by accepting RDKit Chem.Mol inputs, reusing augmented molecule structures for property computation, and standardizing molecule sanitization.
Changes:
- Extend readers/datasets to accept
str | Chem.Molinputs and propagate augmented molecule dictionaries toread_property. - Standardize RDKit SMILES parsing via
sanitize=False+ shared_sanitize_molecule. - Update property token bins / model config edge dimensions and migrate linting to
pre-commit+ Ruff.
Reviewed changes
Copilot reviewed 15 out of 19 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| configs/model/res_aug_amgpool.yml | Updates edge_dim to reflect additional bond feature(s). |
| configs/model/res_aug_aapool.yml | Updates edge_dim to reflect additional bond feature(s). |
| configs/model/gat_aug_amgpool.yml | Updates edge_dim to reflect additional bond feature(s). |
| configs/model/gat_aug_aapool.yml | Updates edge_dim to reflect additional bond feature(s). |
| chebai_graph/preprocessing/reader/reader.py | Adds RDKit-mol support and sanitization to readers; changes _read_data return shape for property reader. |
| chebai_graph/preprocessing/reader/augmented_reader.py | Allows Chem.Mol inputs and returns augmented molecule dict alongside GeomData; updates property reading to accept dict. |
| chebai_graph/preprocessing/property_encoder.py | Makes IndexEncoder.cache robustly mutable when adding new tokens. |
| chebai_graph/preprocessing/properties/properties.py | Adds validation around mol and SMILES conversion before descriptor generation. |
| chebai_graph/preprocessing/properties/base.py | Formatting-only assertion message restructuring. |
| chebai_graph/preprocessing/fg_detection/fg_aware_rule_based.py | Uses shared sanitization when rebuilding fragment mols from SMILES. |
| chebai_graph/preprocessing/datasets/chebi.py | Avoids redundant augmentation/property recomputation; supports tuple-returning _read_data for downstream property reading. |
| chebai_graph/preprocessing/bin/NumAtomBonds/indices_one_hot.txt | Extends token set for NumAtomBonds. |
| chebai_graph/preprocessing/bin/BondType/indices_one_hot.txt | Extends token set for BondType (adds UNSPECIFIED). |
| chebai_graph/preprocessing/bin/AtomNumHs/indices_one_hot.txt | Extends token set for AtomNumHs. |
| chebai_graph/preprocessing/bin/AtomFunctionalGroup/indices_one_hot.txt | Extends token set for AtomFunctionalGroup. |
| chebai_graph/models/dynamic_gni.py | Formatting-only assertion message restructuring. |
| .pre-commit-config.yaml | Migrates lint/format hooks to Ruff + updates hook set. |
| .github/workflows/pre-commit.yml | Adds CI workflow to run pre-commit. |
| .github/workflows/lint.yml | Removes prior standalone lint workflow. |
Comments suppressed due to low confidence (1)
chebai_graph/preprocessing/reader/reader.py:170
GraphReader._read_datanow calls_smiles_to_mol()which returns an RDKitChem.Mol, but the method still assertsisinstance(mol, nx.Graph)and then iteratesmol.nodes/mol.edges. This will fail at runtime for every input. Either restore the previouspysmiles/NetworkX parsing path, or refactor_read_datato build the NetworkX graph from the RDKit molecule (and adjust types/docs accordingly).
# raw_data is a SMILES string
try:
mol = (
self._smiles_to_mol(raw_data) if isinstance(raw_data, str) else raw_data
)
except ValueError:
return None
assert isinstance(mol, nx.Graph)
d: dict[int, int] = {}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # augment molecule graph if possible (this would also happen for the properties if needed, but this avoids redundancy) | ||
| if isinstance(self.reader, _AugmentorReader): | ||
| returned_results = [] | ||
| for mol in features: | ||
| try: | ||
| r = self.reader._create_augmented_graph(mol) | ||
| except Exception: | ||
| r = None | ||
| returned_results.append(r) | ||
| mols = [ | ||
| augmented_mol[1] if augmented_mol is not None else None | ||
| for augmented_mol in returned_results | ||
| ] |
There was a problem hiding this comment.
In _setup_properties, when self.reader is an _AugmentorReader you call self.reader._create_augmented_graph(mol) for each entry in features. For existing SMILES-based datasets features is typically a str, but _create_augmented_graph expects an RDKit Chem.Mol and will fail (caught and turned into None), resulting in mols being all None and property extraction being skipped. Consider using self.reader.read_property(...) directly, or convert SMILES to Chem.Mol via the reader’s _smiles_to_mol before calling _create_augmented_graph.
| if isinstance(row["features"], tuple): | ||
| geom_data, _ = row[ | ||
| "features" | ||
| ] # ignore additional returned data from _read_data (e.g. augmented molecule dict) | ||
| else: | ||
| geom_data = row["features"] |
There was a problem hiding this comment.
GraphPropertiesMixIn._merge_props_into_base now supports row["features"] being a tuple, but other mixins in this same module (e.g., the augmented-graph property mixins) still treat row["features"] as a GeomData and read masks from it. If the processed dataset stores (GeomData, augmented_molecule) (as _AugmentorReader._read_data now returns), those asserts/mask accesses will break at load time. Consider normalizing row["features"] to GeomData earlier (or updating the remaining call sites to unwrap tuples consistently).
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| dataset = ChEBI25_WFGE_WGN_AsPerNodeType(chebi_version=248, subset="3_STAR") | ||
| dataset.prepare_data() | ||
| dataset.setup() |
There was a problem hiding this comment.
This file appears to include an executable __main__ block that runs dataset preparation/setup. For a library/module file this is easy to trigger accidentally and can lead to long-running side effects when users run the module directly. Consider removing it from the PR, or moving this into a dedicated CLI/script (or a reproducible notebook) if it’s meant as a debugging entrypoint.
| if __name__ == "__main__": | |
| dataset = ChEBI25_WFGE_WGN_AsPerNodeType(chebi_version=248, subset="3_STAR") | |
| dataset.prepare_data() | |
| dataset.setup() |
| - uses: actions/checkout@v4 | ||
|
|
||
| - uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.10' | ||
|
|
||
| - name: Run pre-commit | ||
| uses: pre-commit/action@v3.0.1 |
There was a problem hiding this comment.
The workflow YAML is invalid due to indentation: steps: must contain a properly-indented list (the - uses: entries should be indented under steps:). As written, GitHub Actions will fail to parse this workflow.
| - uses: actions/checkout@v4 | |
| - uses: actions/setup-python@v5 | |
| with: | |
| python-version: '3.10' | |
| - name: Run pre-commit | |
| uses: pre-commit/action@v3.0.1 | |
| - uses: actions/checkout@v4 | |
| - uses: actions/setup-python@v5 | |
| with: | |
| python-version: '3.10' | |
| - name: Run pre-commit | |
| uses: pre-commit/action@v3.0.1 |
| hooks: | ||
| - id: black | ||
| - id: black-jupyter # for formatting jupyter-notebook | ||
| # Use `pre-commit autoupdate` to update all the hook. |
There was a problem hiding this comment.
Minor grammar: "update all the hook" should be "update all the hooks".
| # Use `pre-commit autoupdate` to update all the hook. | |
| # Use `pre-commit autoupdate` to update all the hooks. |
| raw_data (str | Chem.Mol): SMILES string or RDKit molecule object. | ||
|
|
||
| Returns: | ||
| GeomData | None: Graph data object or None if molecule parsing failed. |
There was a problem hiding this comment.
The _read_data docstring still documents returning GeomData | None, but the function signature and implementation now return a tuple (GeomData, mol). Please update the docstring return type/description to match the new behavior to avoid misleading callers.
| GeomData | None: Graph data object or None if molecule parsing failed. | |
| tuple[GeomData, Chem.Mol] | None: A tuple of (graph data object, RDKit molecule) | |
| or None if molecule parsing failed. |
| try: | ||
| frag_mol = _sanitize_molecule(frag_mol) | ||
| except Exception: | ||
| pass | ||
| frag_rings = frag_mol.GetRingInfo().AtomRings() |
There was a problem hiding this comment.
frag_mol can be None when Chem.MolFromSmiles(frag, sanitize=False) fails, or it can become None if sanitization fails. In both cases the next line calls frag_mol.GetRingInfo(), which will raise an AttributeError. Add a None check (and decide whether to mark is_ring_fg as False or skip the fragment) before accessing ring info.
| try: | |
| frag_mol = _sanitize_molecule(frag_mol) | |
| except Exception: | |
| pass | |
| frag_rings = frag_mol.GetRingInfo().AtomRings() | |
| if frag_mol is not None: | |
| try: | |
| frag_mol = _sanitize_molecule(frag_mol) | |
| except Exception: | |
| # If sanitization fails, fall back to the original fragment molecule | |
| pass | |
| # If molecule creation or sanitization failed, treat as having no rings | |
| if frag_mol is None: | |
| frag_rings = () | |
| else: | |
| try: | |
| frag_rings = frag_mol.GetRingInfo().AtomRings() | |
| except Exception: | |
| frag_rings = () |
| if isinstance(result["features"], tuple): | ||
| result["features"], raw_data = result["features"] | ||
| if result is None or result["features"] is None: | ||
| return None |
There was a problem hiding this comment.
_preprocess_smiles_for_pred checks isinstance(result["features"], tuple) before verifying that result is not None. Since the next line explicitly handles result is None, this can currently raise a TypeError/AttributeError when to_data(...) returns None. Move the result is None / result["features"] is None guard before dereferencing result["features"].
| if isinstance(result["features"], tuple): | |
| result["features"], raw_data = result["features"] | |
| if result is None or result["features"] is None: | |
| return None | |
| if result is None or result["features"] is None: | |
| return None | |
| if isinstance(result["features"], tuple): | |
| result["features"], raw_data = result["features"] |
Add support for the SDF-based dataset (ChEB-AI/python-chebai#147). This includes mostly:
Chem.Molobjects as input to_read_datafunctionsChem.Molas input toread_property(also: the_read_datafunction now returns the augmented molecule dictionary which gets passed toread_property, avoiding a complete recalculation for each property)sanitize_moleculefunction to ensure consistent SMILES parsingOne issue I came across while testing this: The
AugAtomNumHsproperty (or any property inheriting theFrozenPropertyAlias) doesn't allow new tokens to be created. @aditya0by0 How should I add new tokens here? Should I first build a dataset with the non-augmented version of each property and then use those to created an augmented dataset?